Introduction to Open Data Science - Course Project

About the project

I believe I heard about this course from the mailing list of the doctoral school. I expect to brush up my knowledge on some of the statistical methods including PCA and factor analysis on this course. I have used R before. This file exists in my Github repository.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Thu Nov 26 16:50:48 2020"

The text continues here.


Chapter 2

I decided to start with the exercises and read the book if needed. I also decided to use Tidyverse (which includes ggplot2) and tidyr as I use them whenever possible when doing something with R. The e1071 library is needed for calculating kurtosis.

Loading the data set

The data set used seems to contain results for a survey done for an introductory university course on social statistics. The variables are three averages calculated over answers to a subset of the questions each, an “attitude towards statistics” value determined from answers to some questions, as well as the gender, the age of the student and a total score calculated from the questions. According to comments in the original data set, the three derived variables indicate “surface approach” (surf), “strategic approach” (stra) and “deep approach” (deep.)

Below, the data is loaded and gender is transformed to an integer in order to make calculating a correlation matrix possible.

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tidyr)
library(e1071)

df2 <- read_tsv("data/learning2014.tsv", col_types = cols(
  .default = col_double(),
  gender = col_character(),
  Age = col_integer(),
  Attitude = col_integer(),
  Points = col_integer()
))

df2 <- df2 %>% mutate(gender_ = recode(gender, M = -1, F = 1))

Graphical overview of the data

Below are mean, median minimum and maximum values for each of the variables, as well as the counts of students by gender:

df3 <- df2 %>% select(-c(gender, gender_))
df3 %>% summarise_all(tibble::lst(mean))
## # A tibble: 1 x 6
##   Age_mean Attitude_mean deep_mean stra_mean surf_mean Points_mean
##      <dbl>         <dbl>     <dbl>     <dbl>     <dbl>       <dbl>
## 1     25.5          31.4      3.68      3.12      2.79        22.7
df3 %>% summarise_all(tibble::lst(median))
## # A tibble: 1 x 6
##   Age_median Attitude_median deep_median stra_median surf_median Points_median
##        <dbl>           <dbl>       <dbl>       <dbl>       <dbl>         <dbl>
## 1         22              32        3.67        3.19        2.83            23
df3 %>% summarise_all(tibble::lst(min))
## # A tibble: 1 x 6
##   Age_min Attitude_min deep_min stra_min surf_min Points_min
##     <int>        <int>    <dbl>    <dbl>    <dbl>      <int>
## 1      17           14     1.58     1.25     1.58          7
df3%>% summarise_all(tibble::lst(max))
## # A tibble: 1 x 6
##   Age_max Attitude_max deep_max stra_max surf_max Points_max
##     <int>        <int>    <dbl>    <dbl>    <dbl>      <int>
## 1      55           50     4.92        5     4.33         33
df2 %>% select(gender) %>% group_by(gender) %>% tally
## # A tibble: 2 x 2
##   gender     n
##   <chr>  <int>
## 1 F        110
## 2 M         56

Below are histograms of the values.

scores_long <- df2 %>% select(c(deep, stra, surf)) %>% gather()
ggplot(scores_long, aes(x = value, fill = key)) +
  geom_histogram(position = "dodge") +
  xlim(1, 5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4 rows containing missing values (geom_bar).

ggplot(df2, aes(x = Points)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(df2, aes(x = Age)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The data look reasonably Gaussian, except for age, which clearly is not. Below the kurtoses of the variables are calculated to given an estimate on the gaussianity.

df2 %>% select(-c(gender, gender_)) %>% summarise_all(tibble::lst(kurtosis)) %>% t
##                         [,1]
## Age_kurtosis       3.2396418
## Attitude_kurtosis -0.4788101
## deep_kurtosis      0.6615846
## stra_kurtosis     -0.4527415
## surf_kurtosis     -0.2714614
## Points_kurtosis   -0.2625437

The values for surf and points are close to zero and values for attitude, deep and stra are reasonably close, so the variables seem Gaussian.

Below is a correlation matrix calculated from the variables. The matrix is calculated using R’s built-in implementation for Spearman’s rank correlation coefficient, which I decided to use because some of the variables are discrete.

cor(df2 %>% select(gender_, Age, Attitude, Points, surf, stra, deep), method = "spearman")
##              gender_         Age    Attitude      Points       surf        stra
## gender_   1.00000000 -0.19934006 -0.30088840 -0.09763168  0.1194377  0.14753714
## Age      -0.19934006  1.00000000 -0.00309789  0.06085512 -0.1530342  0.05065326
## Attitude -0.30088840 -0.00309789  1.00000000  0.44074335 -0.1673798  0.07755921
## Points   -0.09763168  0.06085512  0.44074335  1.00000000 -0.1409066  0.16950108
## surf      0.11943771 -0.15303419 -0.16737980 -0.14090656  1.0000000 -0.16145209
## stra      0.14753714  0.05065326  0.07755921  0.16950108 -0.1614521  1.00000000
## deep     -0.06842895  0.02059799  0.09167961 -0.01743412 -0.3209585  0.10575868
##                 deep
## gender_  -0.06842895
## Age       0.02059799
## Attitude  0.09167961
## Points   -0.01743412
## surf     -0.32095846
## stra      0.10575868
## deep      1.00000000

From the correlation matrix it can be seen that there is a less-than-moderate positive correlation between attitude and total points. From the value, as well as by plotting the data points with a regression line, it may be seen that the correlation is not significant:

ggplot(df2, aes(x = Attitude, y = Points)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ggtitle("Total points against attitude score")
## `geom_smooth()` using formula 'y ~ x'

There is also a less-than-moderate negative correlation between gender and attitude, which indicates that males may have a more positive attitude to statistics.

df2 %>% group_by(gender) %>% summarise(`Median attitude` = median(Attitude))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   gender `Median attitude`
##   <chr>              <dbl>
## 1 F                   29.5
## 2 M                   34

Additionally, there is a less-than-moderate negative correlation between surf and deep.

ggplot(df2, aes(x = surf, y = deep)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ggtitle("surf versus deep")
## `geom_smooth()` using formula 'y ~ x'

At this point, I decided to check the Datacamp exercises. Here is a plot with the student’s attitude versus exam points coloured by gender as done in the DataCamp exercise:

ggplot(df2, aes(x = Attitude, y = Points, colour = gender)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ggtitle("Total points against attitude score")
## `geom_smooth()` using formula 'y ~ x'

Regression model

I could not find a simple way to do this with ggplot, so here is a solution similar to the Datacamp exercirses. I chose attitude, surf and stra as the explanatory variables as they had the highest absolute values for correlation with points.

model <- lm(Points ~ Attitude + surf + stra, data = df2)
summary(model)
## 
## Call:
## lm(formula = Points ~ Attitude + surf + stra, data = df2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## Attitude     0.33952    0.05741   5.913 1.93e-08 ***
## surf        -0.58607    0.80138  -0.731  0.46563    
## stra         0.85313    0.54159   1.575  0.11716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The results indicate (in the Estimate column) that of the three variables, stra has the biggest effect on points. On the other hand, attitude is most likely to have an effect on the points, which may be seen from the probability in the last column. As the probability of getting a t value as high or higher when the null hypothesis of the coefficient being zero is true is low enought only for the attitude, it seems that only attitude from the three variables in question may be used to predict the number of points. In effect, when attitude increases by one unit, the prediction is that points increase by 0.34 units.

Graphical model validation

In general, a linear regression model assumes that the target variable is a linear combination of the model parameters. Further, it is assumed that the errors are normally distributed and have constant variance.

The generated model assumes that total points are best explained by attitude, surf and stra. In other words, deep would not effect the total score. (The summary of the model indicates, though, that attitude is significant while surf and stra are not.)

plot(model, c(1, 2, 5))

As the residuals shown in the Q-Q plot mostly follow the line, the errors seem to be Gaussian. The residuals vs. fitted plot shows no discernible pattern, which would indicate the errors having constant variance. The residuals vs leverage plot indicates that no single observation has exceptionally high impact on the model, as the leverage value of each observation is relatively small.


Chapter 3

Reading the data

I am using the whole Tidyverse again, not just ggplot2 or dplyr, as well as tidyr.

library(tidyverse)
library(tidyr)
library(questionr)
library(boot)

df0 <- read_tsv("data/student_alcohol_use.tsv", col_types = cols(
  .default = col_character(),
  age = col_integer(),
  Medu = col_integer(),
  Fedu = col_integer(),
  traveltime = col_integer(),
  studytime = col_integer(),
  failures = col_integer(),
  famrel = col_integer(),
  freetime = col_integer(),
  goout = col_integer(),
  Dalc = col_integer(),
  Walc = col_integer(),
  health = col_integer(),
  absences = col_integer(),
  G1 = col_integer(),
  G2 = col_integer(),
  G3 = col_integer(),
  alc_use = col_double(),
  high_use = col_logical()
))

Guessing the relationship between alcohol consumption and other variables

I did the Datacamp exercises but forgot what the result was. My guess or recollection of some things that are associated with high alcohol consumption is:

  • Males consume more alcohol than females
  • Small amount of free time after school;
  • No extra-curricular activities
  • One of the parents staying at home

Below is a summary of the data.

summary(df0)
##     school              sex                 age          address         
##  Length:382         Length:382         Min.   :15.00   Length:382        
##  Class :character   Class :character   1st Qu.:16.00   Class :character  
##  Mode  :character   Mode  :character   Median :17.00   Mode  :character  
##                                        Mean   :16.59                     
##                                        3rd Qu.:17.00                     
##                                        Max.   :22.00                     
##    famsize            Pstatus               Medu            Fedu      
##  Length:382         Length:382         Min.   :0.000   Min.   :0.000  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median :3.000   Median :3.000  
##                                        Mean   :2.806   Mean   :2.565  
##                                        3rd Qu.:4.000   3rd Qu.:4.000  
##                                        Max.   :4.000   Max.   :4.000  
##      Mjob               Fjob              reason            nursery         
##  Length:382         Length:382         Length:382         Length:382        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    internet           guardian           traveltime      studytime    
##  Length:382         Length:382         Min.   :1.000   Min.   :1.000  
##  Class :character   Class :character   1st Qu.:1.000   1st Qu.:1.000  
##  Mode  :character   Mode  :character   Median :1.000   Median :2.000  
##                                        Mean   :1.448   Mean   :2.037  
##                                        3rd Qu.:2.000   3rd Qu.:2.000  
##                                        Max.   :4.000   Max.   :4.000  
##     failures       schoolsup            famsup              paid          
##  Min.   :0.0000   Length:382         Length:382         Length:382        
##  1st Qu.:0.0000   Class :character   Class :character   Class :character  
##  Median :0.0000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.2016                                                           
##  3rd Qu.:0.0000                                                           
##  Max.   :3.0000                                                           
##   activities           higher            romantic             famrel     
##  Length:382         Length:382         Length:382         Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.000  
##                                                           Mean   :3.937  
##                                                           3rd Qu.:5.000  
##                                                           Max.   :5.000  
##     freetime        goout            Dalc            Walc           health     
##  Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.00   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000  
##  Median :3.00   Median :3.000   Median :1.000   Median :2.000   Median :4.000  
##  Mean   :3.22   Mean   :3.113   Mean   :1.482   Mean   :2.296   Mean   :3.573  
##  3rd Qu.:4.00   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000  
##  Max.   :5.00   Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##     absences          G1              G2              G3           alc_use     
##  Min.   : 0.0   Min.   : 2.00   Min.   : 4.00   Min.   : 0.00   Min.   :1.007  
##  1st Qu.: 1.0   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:1.007  
##  Median : 3.0   Median :12.00   Median :12.00   Median :12.00   Median :1.270  
##  Mean   : 4.5   Mean   :11.49   Mean   :11.47   Mean   :11.46   Mean   :1.482  
##  3rd Qu.: 6.0   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:1.584  
##  Max.   :45.0   Max.   :18.00   Max.   :18.00   Max.   :18.00   Max.   :3.296  
##   high_use      
##  Mode :logical  
##  FALSE:307      
##  TRUE :75       
##                 
##                 
## 

Below are histograms of the variables.

ggplot(df0, aes(x = sex, fill = high_use)) + geom_bar() + ggtitle("Sex vs. high alcohol use")

ggplot(df0, aes(x = freetime, fill = high_use)) + geom_bar() + ggtitle("Histogram for freetime with high alcohol use")

ggplot(df0, aes(x = activities, fill = high_use)) + geom_bar() + ggtitle("Extra-curricular activities vs. high alcohol use")

Please see below for a plot for the parents’ jobs vs. high alcohol use.

Exploring the relationships of the variables

I decided to transform the variables to numeric and calculate a correlation matrix. Here, binary variables are transformed to {-1, 1}. Other categorical measurements are transformed to (at least) ordinal if possible.

The variables Mjob, Fjob, reason and guardian are categorical and difficult to transform, so I plotted them instead.

df1 <- df0 %>% mutate(
    high_use_ = Vectorize(function(val) { if (val) return (1) else { return (-1) }})(high_use)
  )

df2 <- df1 %>% mutate(
    school_ = recode(school, GP = -1, MS = 1),
    sex_ = recode(sex, M = -1, F = 1),
    address_ = recode(address, U = -1, R = 1),
    famsize_ = recode(famsize, LE3 = -1, GT3 = 1),
    Pstatus_ = recode(Pstatus, T = -1, A = 1),
    m_at_home = recode(Mjob, at_home = 1, .default = -1),
    f_at_home = recode(Fjob, at_home = 1, .default = -1),
    schoolsup_ = recode(schoolsup, yes = 1, no = -1),
    famsup_ = recode(famsup, yes = 1, no = -1),
    paid_ = recode(paid, yes = 1, no = -1),
    activities_ = recode(activities, yes = 1, no = -1),
    nursery_ = recode(nursery, yes = 1, no = -1),
    higher_ = recode(higher, yes = 1, no = -1),
    internet_ = recode(internet, yes = 1, no = -1),
    romantic_ = recode(romantic, yes = 1, no = -1)
  ) %>%
  select(-c(
    school,
    sex,
    address,
    famsize,
    Pstatus,
    Mjob,
    Fjob,
    reason,
    guardian,
    schoolsup,
    famsup,
    paid,
    activities,
    nursery,
    higher,
    internet,
    romantic
  ))
cor(
  x = df2 %>% select(-c(high_use, high_use_)),
  y = df2 %>% select(high_use_),
  method = "spearman"
)
##               high_use_
## age          0.05803437
## Medu         0.01693618
## Fedu         0.01417487
## traveltime   0.05407731
## studytime   -0.22656222
## failures     0.11573316
## famrel      -0.09439361
## freetime     0.13833505
## goout        0.35356847
## Dalc         0.54651084
## Walc         0.71476866
## health       0.11636775
## absences     0.17969439
## G1          -0.16759064
## G2          -0.14964691
## G3          -0.13927400
## alc_use      0.71476866
## school_     -0.03989242
## sex_        -0.27531819
## address_     0.04992972
## famsize_    -0.06782646
## Pstatus_     0.05591147
## m_at_home    0.01132896
## f_at_home   -0.10334296
## schoolsup_  -0.05838166
## famsup_     -0.09148788
## paid_        0.01650270
## activities_  0.02028154
## nursery_    -0.11566069
## higher_     -0.10779565
## internet_    0.02547942
## romantic_   -0.02488340
df3 <- df1 %>% select(high_use, Mjob, Fjob) %>% gather(key, value, -c(high_use))
ggplot(df3, aes(x = value, y = key, colour = high_use)) + geom_jitter(alpha = 0.5)

The figure does not seem to be very telling. Here, frequencies for high alcohol use are counted for Mjob and Fjob.

df1 %>% select(high_use, Mjob) %>% count(high_use, Mjob) %>% group_by(Mjob) %>% mutate(prop = prop.table(n))
## # A tibble: 10 x 4
## # Groups:   Mjob [5]
##    high_use Mjob         n  prop
##    <lgl>    <chr>    <int> <dbl>
##  1 FALSE    at_home     42 0.792
##  2 FALSE    health      26 0.788
##  3 FALSE    other      114 0.826
##  4 FALSE    services    77 0.802
##  5 FALSE    teacher     48 0.774
##  6 TRUE     at_home     11 0.208
##  7 TRUE     health       7 0.212
##  8 TRUE     other       24 0.174
##  9 TRUE     services    19 0.198
## 10 TRUE     teacher     14 0.226
df1 %>% select(high_use, Fjob) %>% count(high_use, Fjob) %>% group_by(Fjob) %>% mutate(prop = prop.table(n))
## # A tibble: 9 x 4
## # Groups:   Fjob [5]
##   high_use Fjob         n   prop
##   <lgl>    <chr>    <int>  <dbl>
## 1 FALSE    at_home     16 1     
## 2 FALSE    health      13 0.765 
## 3 FALSE    other      168 0.796 
## 4 FALSE    services    82 0.766 
## 5 FALSE    teacher     28 0.903 
## 6 TRUE     health       4 0.235 
## 7 TRUE     other       43 0.204 
## 8 TRUE     services    25 0.234 
## 9 TRUE     teacher      3 0.0968

Below, frequencies are calculated for reason and guardian, too.

df1 %>% select(high_use, reason) %>% count(high_use, reason) %>% group_by(reason) %>% mutate(prop = prop.table(n))
## # A tibble: 8 x 4
## # Groups:   reason [4]
##   high_use reason         n  prop
##   <lgl>    <chr>      <int> <dbl>
## 1 FALSE    course       111 0.793
## 2 FALSE    home          87 0.791
## 3 FALSE    other         24 0.706
## 4 FALSE    reputation    85 0.867
## 5 TRUE     course        29 0.207
## 6 TRUE     home          23 0.209
## 7 TRUE     other         10 0.294
## 8 TRUE     reputation    13 0.133
df1 %>% select(high_use, guardian) %>% count(high_use, guardian) %>% group_by(guardian) %>% mutate(prop = prop.table(n))
## # A tibble: 6 x 4
## # Groups:   guardian [3]
##   high_use guardian     n  prop
##   <lgl>    <chr>    <int> <dbl>
## 1 FALSE    father      73 0.802
## 2 FALSE    mother     220 0.8  
## 3 FALSE    other       14 0.875
## 4 TRUE     father      18 0.198
## 5 TRUE     mother      55 0.2  
## 6 TRUE     other        2 0.125

Somehow there seems to be a possible connection between choosing the school for “other” reasons and high alcohol use. Below the correlation is calculated to make comparing easier.

cor(
  x = df1 %>% select(high_use_),
  y = df1 %>% select(reason) %>% mutate(reason = recode(reason, other = 1, .default = -1)),
  method = "spearman"
)
##               reason
## high_use_ 0.07694401

The correlation is negligible.

To conclude, there are correlations between high alcohol use and the following:

  • Positive with goout
  • Positive with being male
  • Negative with studytime

Even these are rather weak, though. Of the variables I guessed, only sex had a connection with high alcohol use according to the correlation coefficient.

Logistic regression

The way I read the exercise I am supposed to keep using the same variables that I guessed.

Here is the model:

df4 <- df0 %>% mutate(
    m_at_home = recode(Mjob, at_home = TRUE, .default = FALSE),
    f_at_home = recode(Fjob, at_home = TRUE, .default = FALSE),
    activities = Vectorize(function(val) { if ("yes" == val) return (TRUE) else { return (FALSE) }})(activities)
  ) %>%
  mutate(
    either_at_home = m_at_home | f_at_home
  )

m <- glm(high_use ~ sex + freetime + activities + either_at_home, data = df4, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + freetime + activities + either_at_home, 
##     family = "binomial", data = df4)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0803  -0.8047  -0.4566  -0.3717   2.2992  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -3.03111    0.53269  -5.690 1.27e-08 ***
## sexM                1.44557    0.30430   4.751 2.03e-06 ***
## freetime            0.23085    0.14067   1.641    0.101    
## activitiesTRUE     -0.06856    0.27198  -0.252    0.801    
## either_at_homeTRUE  0.19863    0.38420   0.517    0.605    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 378.39  on 381  degrees of freedom
## Residual deviance: 345.40  on 377  degrees of freedom
## AIC: 355.4
## 
## Number of Fisher Scoring iterations: 5
odds.ratio(m, level = 0.95)
## Waiting for profiling to be done...
##                          OR    2.5 % 97.5 %         p    
## (Intercept)        0.048262 0.016312 0.1323 1.269e-08 ***
## sexM               4.244258 2.379180 7.8877 2.029e-06 ***
## freetime           1.259676 0.959057 1.6672    0.1008    
## activitiesTRUE     0.933734 0.547472 1.5948    0.8010    
## either_at_homeTRUE 1.219728 0.553964 2.5322    0.6052    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model does not look convincing at all (only the student being male seems significant). Furthermore, the confidence intervals of the odds ratios of all the other variables contain the value one, which suggests that they are independent from high alcohol use. To fix this, I chose to use sex and study time as explaining variables.

m1 <- glm(high_use ~ sex + studytime, data = df4, family = "binomial")
summary(m1)
## 
## Call:
## glm(formula = high_use ~ sex + studytime, family = "binomial", 
##     data = df4)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0033  -0.8096  -0.4519  -0.3515   2.3727  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.1851     0.4569  -2.594  0.00949 ** 
## sexM          1.2833     0.3047   4.211 2.54e-05 ***
## studytime    -0.5227     0.1900  -2.751  0.00594 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 378.39  on 381  degrees of freedom
## Residual deviance: 340.03  on 379  degrees of freedom
## AIC: 346.03
## 
## Number of Fisher Scoring iterations: 5
odds.ratio(m1, level = 0.95)
## Waiting for profiling to be done...
##                  OR   2.5 % 97.5 %         p    
## (Intercept) 0.30572 0.12350 0.7440  0.009487 ** 
## sexM        3.60860 2.01699 6.6993 2.538e-05 ***
## studytime   0.59294 0.40259 0.8493  0.005936 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The summary indicates that the student being male increases the log odds of high alcohol use by approximately 1.28. Each unit increase in study time decreases the log odds of high alcohol use by approximately 0.5. The p values (in the last column) indicate that the former is significant and the latter is somewhat significant. Additionally the confidence intervals of the odds ratios of both of the variables do not contain the value one and hence the variables may be connected to high alcohol use.

Below the predictive power of the model is tested.

probabilities <- predict(m1, type = "response")
df4 <- df4 %>%
  mutate(probability = probabilities) %>%
  mutate(prediction = (0.5 < probability))

table(high_use = df4$high_use, prediction = df4$prediction)
##         prediction
## high_use FALSE
##    FALSE   307
##    TRUE     75

For some reason the model did not work at all; its performance was equivalent to guessing that no one of the students is classified as using much alcohol. The training error is 75 / 382 ≈ 0.196.

Below 10-fold cross-validation is tested.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

cv <- cv.glm(data = df4, cost = loss_func, glmfit = m1, K = 10)
cv$delta
## [1] 0.1963351 0.1963351

On one hand the prediction error is in fact lower than in Datacamp’s model. On the other hand, the usefulness of either such a model (or both) or the metric seems questionable because by guessing one can seemingly get a better result.


Chapter 4

Loading the Data

Below the Boston data is loaded. The data set contains various statistics from some 30 years ago including per capita crime rate by town, nitrogen oxides concentration (parts per 10 million) and pupil-teacher ratio by town. The variables seem continuous, so I used the Pearson correlation coefficient.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(corrplot)
## corrplot 0.84 loaded
library(e1071)
library(tidyverse) # Load last in order to use dplyr::select by default.
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
data("Boston")

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
corrplot.mixed(cor(Boston), tl.pos = "d", tl.col = "black", tl.cex = 0.6, number.cex = 0.6)

Some correlations may be seen, including:

  • strong positive for accessibility to radial highways (rad) and full-value property-tax rate per $10,000 (tax)
  • fairly strong positive for nitrogen oxides concentration, parts per 10 million (nox) and proportion of non-retail business acres per town (indus)
  • fairly strong negative for nitrogen oxides concentration, parts per 10 million (nox) and weighted mean of distances to five Boston employment centres (dis)

Below the kurtoses of the variables are calculated to determine whether they are Gaussian.

Boston %>% summarise_all(tibble::lst(kurtosis)) %>% t
##                         [,1]
## crim_kurtosis    36.59581589
## zn_kurtosis       3.95238731
## indus_kurtosis   -1.24019490
## chas_kurtosis     9.48197035
## nox_kurtosis     -0.08741064
## rm_kurtosis       1.84183241
## age_kurtosis     -0.97802966
## dis_kurtosis      0.45759158
## rad_kurtosis     -0.87892910
## tax_kurtosis     -1.15031761
## ptratio_kurtosis -0.30480100
## black_kurtosis    7.10371496
## lstat_kurtosis    0.46281705
## medv_kurtosis     1.45098366

From the values, it may be seen that some of the values are too peaked to be Gaussian, including per capita crime rate by town (crim) and the ethnicity of the inhabitants (black). Some, on the other hand, are too flat, including proportion of non-retail business acres per town (indus) and value property-tax rate per $10,000 (tax).

Below the data is standardised like in the DataCamp exercise. After scaling the mean of each variable is zero. The crime rate is also converted to a categorical variable.

boston_ <- Boston %>% scale %>% data.frame %>% tibble
crime_bins <- quantile(boston_$crim)
crime <- cut(boston_$crim, breaks = crime_bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
boston_$crim = crime

Linear Discriminant Analysis

Below, training and test data sets are created by selecting 80% of the data by random to the former.

n <- nrow(boston_)
idxs <- sample(n, size = n * 0.8)
training <- boston_[idxs,]
testing <- boston_[-idxs,]

Next, linear discriminant analysis is done.

model <- lda(crim ~ ., data = training)
classes <- as.numeric(training$crim)
plot(model, dimen = 2, col = classes, pch = classes)

Below, the model is used to predict the crime rates in the testing data set.

actual_classes <- testing$crim
testing <- testing %>% select(-crim)
prediction <- predict(model, newdata = testing)
table(correct = actual_classes, predicted = prediction$class)
##           predicted
## correct    low med_low med_high high
##   low       13      11        1    0
##   med_low    3       8       11    0
##   med_high   0       8       20    3
##   high       0       0        1   23

By looking at the cross tabulation it seems that the model can predict crime rates quite well.

Clustering

Below, the Euclidean distance matrix is calculated and clustering is done. R’s implementation of K-means seems to calculate the distances by itself, so there is not much use for the precalculated distances in this case.

boston_ <- Boston %>% scale %>% data.frame %>% tibble
dist_mat <- dist(boston_)
summary(dist_mat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
km1 <- kmeans(boston_, centers = 4)

do_plot <- function (data, km) {
  ggpairs(
    data,
    mapping = ggplot2::aes(colour = as.factor(km$cluster)),
    upper = list(continuous = wrap("cor", size = 2.0)),
    lower = list(continuous = wrap("points", alpha = 0.3, size = 0.3)),
    diag = list(continuous = wrap("densityDiag", size = 0.3)),
    combo = wrap("dot", alpha = 0.4, size = 0.4)
  ) +
  theme(
    axis.text.x = element_text(size = 6),
    axis.text.y = element_text(size = 6),
    axis.ticks = element_line(colour = "black", size = 0.3),
    text = element_text(size = 10)
  )
}

do_plot(boston_, km1)
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'combo' are being
## ignored. If these are meant to be aesthetics, submit them using the 'mapping'
## variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

Just by looking at the plots it seems that there are four discernible clusters in none of the cases. (I think dimensionality reduction might be a good idea.) Below the WCSS is calculated to determine a good number of clusters.

set.seed(43)
max_clusters <- 10
twcss <- sapply(1:max_clusters, function(k) { kmeans(boston_, centers = k)$tot.withinss})
df1 <- cbind(1:max_clusters, twcss) %>% as.data.frame %>% tibble
colnames(df1) <- c("Clusters", "TWCSS")
ggplot(df1, aes(x = Clusters, y = TWCSS)) + geom_line() + geom_point() + scale_x_continuous(breaks = 1:max_clusters)

Based on the WCSS it seems that two clusters would be the best choice.

km2 <- kmeans(boston_, centers = 2)
do_plot(boston_, km2)
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'combo' are being
## ignored. If these are meant to be aesthetics, submit them using the 'mapping'
## variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

K-means with LDA

The data set was already standardised for clustering. Below, k-means is run with three clusters and LDA is used to generate a model. The original function for drawing the arrows for the LDA is from a Stack Overflow answer.

km3 <- kmeans(boston_, centers = 3)
boston__ <- boston_ %>% mutate(Cluster = km3$cluster)
model2 <- lda(Cluster ~ ., data = boston__)

lda.arrows <- function(x, x0, y0, myscale = 1, tex = 0.75, choices = c(1,2), ...){
  ## adds `biplot` arrows to an lda using the discriminant function values
  heads <- coef(x)
  arrows(x0 = x0, y0 = y0, 
         x1 = myscale * (x0 + heads[,choices[1]]), 
         y1 = myscale * (y0 + heads[,choices[2]]), ...)
  text(x = myscale * (x0 + heads[,choices[1]]), y = myscale * (y0 + heads[,choices[2]]), labels = row.names(heads), 
    cex = tex)
}

plot(model2, dimen = 2, col = boston__$Cluster, pch = boston__$Cluster)
lda.arrows(model2, x0 = rep(1:5 - 2, 3)[1:14], y0 = rep(0:2, 5)[1:14], myscale = 1, length = 0.1)

Based on the plots it seems that the following are (some of) the most influencial linear separators for the clusters:

  • index of accessibility to radial highways (rad)
  • full-value property-tax rate per $10,000 (tax)
  • proportion of owner-occupied units built prior to 1940 (age)
  • proportion of residential land zoned for lots over 25,000 sq.ft. (zn)

3D Plots for Crime Rate

In the two R sections below is the code from the exercise (slightly modified), as well as the plots. I assume the same model predictors are supposed to be used with k-means. To determine a suitable k, I calculated TWCSS again.

model_predictors <- dplyr::select(training, -crim)

max_clusters <- 6
twcss2 <- sapply(1:max_clusters, function(k) { kmeans(model_predictors, centers = k)$tot.withinss})
df2 <- cbind(1:max_clusters, twcss2) %>% as.data.frame %>% tibble
colnames(df2) <- c("Clusters", "TWCSS")
ggplot(df2, aes(x = Clusters, y = TWCSS)) + geom_line() + geom_point() + scale_x_continuous(breaks = 1:max_clusters)

Removing crime rate did not seem to notably affect TWCSS. Since there are four levels of crime, I used four centers for k-means.

# Check the dimensions
dim(model_predictors)
## [1] 404  13
dim(model$scaling)
## [1] 13  3
# Matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% model$scaling
matrix_product <- as.data.frame(matrix_product)

km4 <- kmeans(model_predictors, centers = 4)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type = 'scatter3d', mode = 'markers', size = 0.3, alpha = 0.3, color = training$crim)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type = 'scatter3d', mode = 'markers', size = 0.3, alpha = 0.3, color = as.factor(km4$cluster))

In this dimensionality reduction, there is one clearly separate cluster around the point x = 7, y = 0, z = 0. This was better separated in the first plot. K-means found some of the points in the larger cluster to be closer to those in the previously mentioned cluster. In both plots the larger cluster seems to consist of three parts the shapes of which seem to match.


Chapter 5

Loading the Data and Graphical Overview

I saved the data in an rds file. Below, the data is loaded and the name of one variable clarified.

library(tidyverse)
library(GGally)
library(e1071)
human <- read_rds("data/human.rds")
ggpairs(
  human,
  upper = list(continuous = wrap("cor", size = 3.0)),
  lower = list(continuous = wrap("points", alpha = 0.3, size = 0.3)),
  diag = list(continuous = wrap("densityDiag", size = 0.3)),
  combo = wrap("dot", alpha = 0.4, size = 0.4)
) +
theme(
  axis.text.x = element_text(size = 6),
  axis.text.y = element_text(size = 6),
  axis.ticks = element_line(colour = "black", size = 0.3),
  text = element_text(size = 7)
)
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'combo' are being
## ignored. If these are meant to be aesthetics, submit them using the 'mapping'
## variable within ggpairs with ggplot2::aes or ggplot2::aes_string.

Some of the strong correlations include:

  • Negative between life expetancy at birth and maternal mortality ratio
  • Positive between life expetancy at birth and expected years of education
  • Positive between adolescent birth rate and maternal mortality ratio

Below the kurtoses of the variables are calculated to determine whether they are Gaussian.

human %>% summarise_all(tibble::lst(kurtosis)) %>% t
##                                                       [,1]
## Secondary education ratio F/M_kurtosis           0.5450204
## Labour force participation ratio F/M_kurtosis    0.0498217
## Expected_Years_of_Education_kurtosis            -0.3438440
## Life_Expectancy_at_Birth_kurtosis               -0.1496657
## Gross_National_Income_GNI_per_Capita_kurtosis    6.8292110
## Maternal_Mortality_Ratio_kurtosis                4.1565597
## Adolescent_Birth_Rate_kurtosis                   0.8900140
## Percent_Representation_in_Parliament_F_kurtosis -0.1031777

Especially GNI and maternal mortality ratio are too peaked to be Gaussian.

Principal Component Analysis

Below the PCA is done on non-standardised data.

pca1 <- prcomp(human)
biplot(pca1, choices = 1:2, cex = c(0.5, 0.8), main = "PCA done without scaling the data first")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Below, the standardisation is done first.

pca2 <- prcomp(scale(human))
biplot(pca2, choices = 1:2, cex = c(0.5, 0.8), main = "PCA done after scaling the data")

The variable names are self-explanatory. The partially obfuscated variable names on the left are Expected_Years_of_Education, GNI per capita, Secondary education ratio F/M, and Life_Expectancy_at_Birth.

Since principal components are linear combinations that contain as much of the variance of the input data as possible, not scaling the variables can lead to some variables with naturally broad scales dominating. The second set of results is in that sense more accurate.

According to the plot, the first principal component consists of the following variables:

  • Expected years of education
  • GNI per capita
  • Secondary education ratio
  • Life expectancy at birth
  • Maternal mortality ratio
  • Adolescent birth rate

The first four are positively correlated with each other, as are the latter two. Variables in one of the groups are negatively correlated with variables in another. The second principal component consists of labour force participation ratio F/M and percent representation in parliament F.

Multiple Correspondence Analysis

Below the tea data is loaded from FactoMineR. A summary of the variables is also shown.

library(FactoMineR)
data(tea)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 

I decided to keep the first 18 variables for MCA. I marked age as a supplementary variable since it is not a factor.

mca <- MCA(tea[, 1:18], graph = FALSE)
plot(mca, invisible = c("ind"), habillage = "quali")

Since the plot shows similar variables together, some observations about the factors may be made:

  • Green tea is similar to dinner
  • Upscale (w.r.t. price) unpackaged tea is similar to tea room
  • Private label and branded tea (w.r.t. price) is similar to tea bag and chain store
  • Earl Grey is similar to milk

The similarity can indicate that e.g. milk is added more often to Earl Grey than other tea blends.